In this research project we looked at the genomics of the phylum Bacteriodota and the phylums located at the Konza Prairie Biological Station. They were looked at separately and in relation to one another. We aimed to generate an analysis of Bacteriodota and the Konza Site, using a multitude of evolutionary genomics and bioinformatic techniques. Our research used data exclusively from the National Ecological Observatory Network (NEON) and the Integrated Microbial Genome database (IMG) in order to complete our investigation.
In this paper, we have explored many facets of technology-based genomics. We were assigned a specific US location and phylum. We have conducted an in-depth investigation in order to find out more about the phylum. Throughout this class, we have used a plethora of investigative techniques to provide information on our phylum. The group members will use our learned knowledge in Evolutionary Genomics & Bioinformatics to create a comprehensive analysis of the phylum, Bacteroidota.
Throughout this class we have worked on researching a specific site and phylum. The specific NEON site we were assigned is the Konza Prairie Biological Station in Kansas, US site (NEON.) This site contains information about the type of location, elevation, NCLD class, plot size, slope aspect, slope gradient, potential sampling modules, domain, and the specific coordinates of the site (NEON.) The site also shows the distribution of various animals, and resources across the site (NEON). The phylum assigned to our group was Bacteroidota. Bacteroidota is a phylum that consists of gram-negative bacteria often found in soil, the ocean, in animal digestive systems, and other similar microbiomes (Wiki.) This phylum consists of 6 classes (Oliphant.) Bacteroidota are well known for their symbiotic relationship within the gastrointestinal tract of animals. We will use the information from the Metagenome assembly with all of the site and phylum information to analyze patterns and features of Bacteroidota. [Oliphant et al. (n.d.)](Mason et al. 2023)[Shibata and Nakane (2023)](Schoch et al. 2020)[“Konza Prairie Agroecosystem NEON NSF NEON Open Data to Understand Our Ecosystems” (n.d.)](“Bacteroidota” 2024)[ (n.d.a)](n.d.b)
The main data used for this resort came from NEON and IMG. NEON collected their data by using different data collection techniques.[ (n.d.a)](n.d.b) These techniques include using airborne remote sensing, automated instruments, and observational sampling. Airborne remote sensing is done by using NEONs Airborne Observation Platform where a light aircraft is used in order to gather high resolution remote sensing data from a low altitude. The automated instruments that NEON uses are controlled and operated by the Instrumented System. The Instrumented System is divided into two systems the Aquatic Instrumented System and the Terrestrial Instrumented System. These systems collected data from meteorological, soil, phenological, surface water, and groundwater. Observational sampling by NEON consists of the use of the Observational System. This system has NEON field scientists to collect observations and samples from different sites, both aquatic and terrestrial, during different times of year. These NEON field scientists are split up into the Aquatic Observational System and the Terrestrial Observational System similar to how the Instrumented System is set up. Once all of NEON’s samples were collected they analyzed and organized the data in order to upload it to IMG where our group was able to take that data and manipulate it.(n.d.b) We manipulated the data in order to show data from our specific phylum, Bacteriodota, and fromout specific site, Konza Prairie Biological Station in Kansas, US. Many different resources were used in order to manipulate and organize the data to our benefit. These resources include the NEON website, the IMG website, Sankey Plot, Postit Cloud, Rmarkdown, Zotero, Pavian, and many different filtering methods on Postit Cloud and Rmarkdown. All of these resources were used to write code that represented the specific data needed. This code includes…..
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
## ggtree v3.12.0 For help: https://yulab-smu.top/treedata-book/
##
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
##
## Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam.
## ggtree: an R package for visualization and annotation of phylogenetic
## trees with their covariates and other associated data. Methods in
## Ecology and Evolution. 2017, 8(1):28-36. doi:10.1111/2041-210X.12628
##
## Guangchuang Yu, Tommy Tsan-Yuk Lam, Huachen Zhu, Yi Guan. Two methods
## for mapping and visualizing associated data on phylogeny using ggtree.
## Molecular Biology and Evolution. 2018, 35(12):3041-3043.
## doi:10.1093/molbev/msy194
##
## Shuangbin Xu, Lin Li, Xiao Luo, Meijun Chen, Wenli Tang, Li Zhan, Zehan
## Dai, Tommy T. Lam, Yi Guan, Guangchuang Yu. Ggtree: A serialized data
## object for visualization of a phylogenetic tree and annotation data.
## iMeta 2022, 1(4):e56. doi:10.1002/imt2.56
##
## Attaching package: 'ggtree'
##
## The following object is masked from 'package:tidyr':
##
## expand
library(TDbook) #A Companion Package for the Book "Data Integration, Manipulation and Visualization of Phylogenetic Trees" by Guangchuang Yu (2022, ISBN:9781032233574).
library(ggimage)
library(rphylopic)## You are using rphylopic v.1.4.0. Please remember to credit PhyloPic contributors (hint: `get_attribution()`) and cite rphylopic in your work (hint: `citation("rphylopic")`).
##
## Attaching package: 'rphylopic'
##
## The following object is masked from 'package:ggimage':
##
## geom_phylopic
## treeio v1.28.0 For help: https://yulab-smu.top/treedata-book/
##
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
##
## LG Wang, TTY Lam, S Xu, Z Dai, L Zhou, T Feng, P Guo, CW Dunn, BR
## Jones, T Bradley, H Zhu, Y Guan, Y Jiang, G Yu. treeio: an R package
## for phylogenetic tree input and output with richly annotated and
## associated data. Molecular Biology and Evolution. 2020, 37(2):599-603.
## doi: 10.1093/molbev/msz240
##
## G Yu. Data Integration, Manipulation and Visualization of Phylogenetic
## Trees (1st ed.). Chapman and Hall/CRC. 2022. ISBN: 9781032233574
##
## Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam.
## ggtree: an R package for visualization and annotation of phylogenetic
## trees with their covariates and other associated data. Methods in
## Ecology and Evolution. 2017, 8(1):28-36. doi:10.1111/2041-210X.12628
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
##
## LG Wang, TTY Lam, S Xu, Z Dai, L Zhou, T Feng, P Guo, CW Dunn, BR
## Jones, T Bradley, H Zhu, Y Guan, Y Jiang, G Yu. treeio: an R package
## for phylogenetic tree input and output with richly annotated and
## associated data. Molecular Biology and Evolution. 2020, 37(2):599-603.
## doi: 10.1093/molbev/msz240
##
## G Yu. Data Integration, Manipulation and Visualization of Phylogenetic
## Trees (1st ed.). Chapman and Hall/CRC. 2022. ISBN: 9781032233574
##
## Attaching package: 'tidytree'
##
## The following object is masked from 'package:treeio':
##
## getNodeNum
##
## The following object is masked from 'package:stats':
##
## filter
##
## Attaching package: 'ape'
##
## The following objects are masked from 'package:tidytree':
##
## drop.tip, keep.tip
##
## The following object is masked from 'package:treeio':
##
## drop.tip
##
## The following object is masked from 'package:ggtree':
##
## rotate
##
## The following object is masked from 'package:dplyr':
##
## where
##
## Attaching package: 'TreeTools'
##
## The following object is masked from 'package:tidytree':
##
## MRCA
##
## The following object is masked from 'package:treeio':
##
## MRCA
##
## The following object is masked from 'package:ggtree':
##
## MRCA
## Loading required package: maps
##
## Attaching package: 'maps'
##
## The following object is masked from 'package:purrr':
##
## map
##
##
## Attaching package: 'phytools'
##
## The following object is masked from 'package:TreeTools':
##
## as.multiPhylo
##
## The following object is masked from 'package:treeio':
##
## read.newick
## ggtreeExtra v1.14.0 For help: https://yulab-smu.top/treedata-book/
##
## If you use the ggtree package suite in published research, please cite
## the appropriate paper(s):
##
## S Xu, Z Dai, P Guo, X Fu, S Liu, L Zhou, W Tang, T Feng, M Chen, L
## Zhan, T Wu, E Hu, Y Jiang, X Bo, G Yu. ggtreeExtra: Compact
## visualization of richly annotated phylogenetic data. Molecular Biology
## and Evolution. 2021, 38(9):4039-4042. doi: 10.1093/molbev/msab166
NEON_MAGs <- read_csv("data/NEON/GOLD_Study_ID_Gs0161344_NEON_2024_4_21.csv") %>%
# remove columns that are not needed for data analysis
select(-c(`GOLD Study ID`, `Bin Methods`, `Created By`, `Date Added`, `Bin Lineage`)) %>%
# create a new column with the Assembly Type
mutate("Assembly Type" = case_when(`Genome Name` == "NEON combined assembly" ~ `Genome Name`,
TRUE ~ "Individual")) %>%
mutate_at("Assembly Type", str_replace, "NEON combined assembly", "Combined") %>%
mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "d__", "") %>%
mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "p__", "") %>%
mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "c__", "") %>%
mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "o__", "") %>%
mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "f__", "") %>%
mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "g__", "") %>%
mutate_at("GTDB-Tk Taxonomy Lineage", str_replace, "s__", "") %>%
separate(`GTDB-Tk Taxonomy Lineage`, c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species"), ";", remove = FALSE) %>%
mutate_at("Domain", na_if,"") %>%
mutate_at("Phylum", na_if,"") %>%
mutate_at("Class", na_if,"") %>%
mutate_at("Order", na_if,"") %>%
mutate_at("Family", na_if,"") %>%
mutate_at("Genus", na_if,"") %>%
mutate_at("Species", na_if,"") %>%
# Get rid of the the common string "Soil microbial communities from "
mutate_at("Genome Name", str_replace, "Terrestrial soil microbial communities from ", "") %>%
# Use the first `-` to split the column in two
separate(`Genome Name`, c("Site","Sample Name"), " - ") %>%
# Get rid of the the common string "S-comp-1"
mutate_at("Sample Name", str_replace, "-comp-1", "") %>%
# separate the Sample Name into Site ID and plot info
separate(`Sample Name`, c("Site ID","subplot.layer.date"), "_", remove = FALSE,) %>%
# separate the plot info into 3 columns
separate(`subplot.layer.date`, c("Subplot", "Layer", "Date"), "-")## Rows: 1754 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): Bin ID, Genome Name, Bin Quality, Bin Lineage, GTDB-Tk Taxonomy L...
## dbl (10): IMG Genome ID, Bin Completeness, Bin Contamination, Total Number ...
## date (1): Date Added
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 624 rows [1131, 1132,
## 1133, 1134, 1135, 1136, 1137, 1138, 1139, 1140, 1141, 1142, 1143, 1144, 1145,
## 1146, 1147, 1148, 1149, 1150, ...].
NEON_metagenomes <- read_tsv("data/NEON/exported_img_data_Gs0161344_NEON.tsv") %>%
select(-c(`Domain`, `Sequencing Status`, `Sequencing Center`)) %>%
rename(`Genome Name` = `Genome Name / Sample Name`) %>%
filter(str_detect(`Genome Name`, 're-annotation', negate = T)) %>%
filter(str_detect(`Genome Name`, 'WREF plot', negate = T))## Rows: 176 Columns: 46
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (18): Domain, Sequencing Status, Study Name, Genome Name / Sample Name, ...
## dbl (16): taxon_oid, IMG Genome ID, Depth In Meters, Elevation In Meters, Ge...
## lgl (12): Altitude In Meters, Chlorophyll Concentration, Longhurst Code, Lon...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
NEON_metagenomes <- NEON_metagenomes %>%
# Get rid of the the common string "Soil microbial communities from "
mutate_at("Genome Name", str_replace, "Terrestrial soil microbial communities from ", "") %>%
# Use the first `-` to split the column in two
separate(`Genome Name`, c("Site","Sample Name"), " - ") %>%
# Get rid of the the common string "-comp-1"
mutate_at("Sample Name", str_replace, "-comp-1", "") %>%
# separate the Sample Name into Site ID and plot info
separate(`Sample Name`, c("Site ID","subplot.layer.date"), "_", remove = FALSE,) %>%
# separate the plot info into 3 columns
separate(`subplot.layer.date`, c("Subplot", "Layer", "Date"), "-")## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [53].
NEON_chemistry <- read_tsv("data/NEON/neon_plot_soilChem1_metadata.tsv") %>%
# remove -COMP from genomicsSampleID
mutate_at("genomicsSampleID", str_replace, "-COMP", "")## Rows: 87 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (5): genomicsSampleID, siteID, plotID, nlcdClass, horizon
## dbl (11): decimalLatitude, decimalLongitude, elevation, soilTemp, d15N, org...
## date (1): collectionDate
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
NEON_MAGs_metagenomes_chemistry <- NEON_MAGs %>%
left_join(NEON_metagenomes, by = "Sample Name") %>%
left_join(NEON_chemistry, by = c("Sample Name" = "genomicsSampleID")) %>%
rename("label" = "Bin ID")NEON_MAGs_metagenomes_chemistry_noblank <- NEON_MAGs_metagenomes_chemistry %>%
rename("AssemblyType" = "Assembly Type") %>%
rename("BinCompleteness" = "Bin Completeness") %>%
rename("BinContamination" = "Bin Contamination") %>%
rename("TotalNumberofBases" = "Total Number of Bases") %>%
rename("EcosystemSubtype" = "Ecosystem Subtype")tree_arc <- read.tree("data/NEON/gtdbtk.ar53.decorated.tree")
tree_bac <- read.tree("data/NEON/gtdbtk.bac120.decorated.tree")node_vector_bac = c(tree_bac$tip.label,tree_bac$node.label)
grep("Bacteroidota", node_vector_bac, value = TRUE)## [1] "'1.0:p__Bacteroidota'"
## [1] 2507
Figure 1 shows indepedent assembly for the phylum bacteroidota. It shows the phylogeny in the Sankey tree. There are two main classes of bacteroidota, bacteroidia, which is the largest one, and ignavibacteria.
Figure 2 shows combined assembly for the phylum bacteroidota. It shows the phylogeny in the Sankey tree. There are still two main classes of bacteroidota, bacteroidia, which is the largest one, and ignavibacteria. The Orders seem to be less distinct from one another when organized in a combined assembly.
Figure 3 shows the Sankey phylogenic tree for the site, Konza. There are 9 phyla, the largest one being actinobacteria. The distribution of species is considerable, and this shows the diversity of Konza.
ggtree(tree_Bacteroidota, layout="circular", branch.length="none") %<+%
NEON_MAGs_metagenomes_chemistry +
geom_point2(mapping=aes(color=`Ecosystem Subtype`, size=`Total Number of Bases`)) +
new_scale_fill() +
geom_fruit(
data=NEON_MAGs_metagenomes_chemistry_noblank,
geom=geom_tile,
mapping=aes(y=label, x=1, fill= AssemblyType),
offset=0.08, # The distance between external layers, default is 0.03 times of x range of tree.
pwidth=0.25 # width of the external layer, default is 0.2 times of x range of tree.
)## ! The following column names/name: Site.x, Sample Name, Site ID.x, Subplot.x, Layer.x, Date.x, IMG Genome ID.x, Bin Quality, GTDB-Tk Taxonomy Lineage, Domain, Phylum, Class, Order, Family, Genus, Species, 5s rRNA, 16s rRNA, 23s rRNA, tRNA Genes, Gene Count, Scaffold Count, taxon_oid, Study Name, Site.y, Site ID.y, Subplot.y, Layer.y, Date.y, IMG Genome ID.y, GOLD Study ID, Ecosystem, Ecosystem Category, Ecosystem Type, Specific Ecosystem, Altitude In Meters, Chlorophyll Concentration, Depth In Meters, Elevation In Meters, Geographic Location, Habitat, Isolation, Isolation Country, Latitude, Longhurst Code, Longhurst Description, Longitude, Nitrate Concentration, Oxygen Concentration, pH, Pressure, Salinity, Salinity Concentration, Sample Collection Date, Sample Collection Temperature, Subsurface In Meters, Genome Size * assembled, Gene Count * assembled, Scaffold Count * assembled, Genome MetaBAT Bin Count * assembled, Genome EukCC Bin Count * assembled, CRISPR Count * assembled, GC Count * assembled, GC * assembled, Coding Base Count * assembled, Coding Base Count % * assembled, CDS Count * assembled, CDS % * assembled, siteID, plotID, nlcdClass, decimalLatitude, decimalLongitude, elevation, collectionDate, horizon, soilTemp, d15N, organicd13C, nitrogenPercent, organicCPercent, CNratio, soilInWaterpH, soilInCaClpH are/is the same to tree data, the tree data column names are : label, y, angle, Site.x, Sample Name, Site ID.x, Subplot.x, Layer.x, Date.x, IMG Genome ID.x, Bin Quality, GTDB-Tk Taxonomy Lineage, Domain, Phylum, Class, Order, Family, Genus, Species, Bin Completeness, Bin Contamination, Total Number of Bases, 5s rRNA, 16s rRNA, 23s rRNA, tRNA Genes, Gene Count, Scaffold Count, Assembly Type, taxon_oid, Study Name, Site.y, Site ID.y, Subplot.y, Layer.y, Date.y, IMG Genome ID.y, GOLD Study ID, Ecosystem, Ecosystem Category, Ecosystem Subtype, Ecosystem Type, Specific Ecosystem, Altitude In Meters, Chlorophyll Concentration, Depth In Meters, Elevation In Meters, Geographic Location, Habitat, Isolation, Isolation Country, Latitude, Longhurst Code, Longhurst Description, Longitude, Nitrate Concentration, Oxygen Concentration, pH, Pressure, Salinity, Salinity Concentration, Sample Collection Date, Sample Collection Temperature, Subsurface In Meters, Genome Size * assembled, Gene Count * assembled, Scaffold Count * assembled, Genome MetaBAT Bin Count * assembled, Genome EukCC Bin Count * assembled, CRISPR Count * assembled, GC Count * assembled, GC * assembled, Coding Base Count * assembled, Coding Base Count % * assembled, CDS Count * assembled, CDS % * assembled, siteID, plotID, nlcdClass, decimalLatitude, decimalLongitude, elevation, collectionDate, horizon, soilTemp, d15N, organicd13C, nitrogenPercent, organicCPercent, CNratio, soilInWaterpH, soilInCaClpH.
## Warning: Removed 32 rows containing missing values or values outside the scale range
## (`geom_point_g_gtree()`).
Figure 4 is a phylogenetic tree including information from Bacteriodota only. This tree shows the lineages of Bacteriododta from their earliest point in evolution. Each lineage ranges from having 2e+06 to 3E+06 total base pairs in their geonomes. This tree also shows where each lineage is found and weather or not they have an idivisual assembly or a combined assembly.
ggplotly(
ggplot(data= NEON_Bact ,aes(x = organicCPercent, y = nitrogenPercent)) +
geom_point(aes(color= Family))+
labs(title = "Nitrogen % vs. Organic C %")
)Figure 5 has three different families of bacteroidota displayed, Chitinophagaceae, Ignavibacteriaceae, and vadinHA17. Chitinophagaceae shows up at four different ratios of nitrogen and carbon, unlike the others , which only show up once on the graph. Chitinophagaceae required less nitrogen than the other families on the graph. Ignavibacteriaceae showed up in high carbon and medium nitrogen percentages, while vadinHA17 appeared on the graph at both high nitrogen and carbon percentages. Figure 4 also had much less nitrogen percentage than carbon percentage in general as nitrogen went from 1-2 %, while carbon was from 10-40%. The other families besides Cytophagaceae, OLB5 and Sphingobacteriaceae are found at the same level as Ignavibacteriaceae.
ggplotly(
ggplot(data= NEON_Bact ,aes(x = soilInWaterpH , y = soilInCaClpH)) +
geom_point(aes(color= Family))+
labs(title = "Soil in water pH vs. Soil in CaCl pH")
)Figure 6 seems to show a linear relationship between water pH and CaCl pH with no obvious clump of families at one specific pH. Most of the families appear to live in more acidic soil regardless of whether it is in CaCl or water with UBA10428 being close to neutral pH. The family Chitinophagaceae doesn’t appear to share pH values with other families. This family also has many values that are very acidic and only a few that are closer to neutral. On the other hand vadinHA17 shares many of its pH values with other families and has 3 different values that are more acidic for soil in CaCl than in water. Most of the families in figure 6 were either very acidic ranging from around 6-3 pH.
ggplotly(
ggplot(data= NEON_Bact ,aes(x = fct_infreq(`Site ID.x`), y = soilTemp)) +
geom_point(aes(color= Family))+
geom_boxplot() +
theme(axis.text.x = element_text(angle=45, vjust=1, hjust=1))+
labs(title = "Relationship between Soil Temp and Site ID")
)## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Figure 7 shows the soil temperature at the sites that Bacteroidota exists at. Chitinophagaceae is able to live in the widest range of temperatures so it exists at most of the sites that bacteroidota is found at. It also lives at TEAK, which has the highest soil temperatures. The other families don’t show up at as many sites or at as many temperatures. For example, OLB5 only shows up at TOOL at one specific temperature.
ggplotly(
ggplot(data=NEON_Bact ,aes(x = soilTemp, y = fct_infreq(`Ecosystem Subtype`))) +
geom_point(aes(color= Family))+
labs(title = "Soil temperature vs. Ecosystem")
)Figure 8 shows the ecosystem types vs. the soil temperature that different families of Bacteroidota exist in. Bacteroidota only shows up at four types of ecosystems, Taiga, wetlands, Tundra, and Temperate forest. Chitinophagaceae shows up at all four of those ecosystems at varying temperatures. vadinHA17 only shows up in wetlands and Tundra, which have a difference in temperature of 6 degrees, while the rest of the families only show up in one ecosystem at one temperature.
ggplotly(
ggplot(data=NEON_Bact ,aes(x = Family, y = fct_infreq(`Ecosystem Subtype`))) +
geom_point(aes(color= Family))+
labs(title = "Bacteriodota famlies Ecosystem location")
)Figure 9 shows the ecosystem types that different families of Bacteroidota exist in. Bacteroidota only shows up at 5 types of ecosystems, Taiga, wetlands, Tundra, temperate forest and shrubland. Chitinophagaceae shows up at all of those ecosystems.the rest show up in two different ecosystems, except for UBA10428, which only shows up in wetlands. The biome with the most biodiversity is shrubland with five of the families present, while wetlands is second with four families.
ggplotly(
ggplot(data=NEON_Bact ,aes(x = CNratio, y = Family)) +
geom_point(aes(color=Family))+
coord_flip()+
labs(title = "CN ratio vs. family")
)Figure 10 shows the carbon nitrogen ratio in realtion to the different familes of Bacteriododta. Only one point is shown on this graph showing that Igavibacteriaceae is the only family in the Bacteriododta phylum that the carbon nitrogen ration has and effect on.
ggplotly(
ggplot(data= NEON_full ,aes(x = organicCPercent, y = nitrogenPercent)) +
geom_point(aes(color= Phylum))+
labs(title = "Nitrogen % vs. Organic C %")
)Figure 11 has many different phylums. Most of the phylums seem to be concentrated at the area with less carbon and less nitrogen. Pseudomonadota has the largest range of different ratios as it exists at both a high nitrogen and carbon percentage and a low nitrogen and carbon percentage. There are many overlapping phylums that exist at the same ratio as others on this graph such as Verrucomicrobiota, which exists at 36.86% carbon and 1.36% Nitrogen along with others. Bacteroidota has a large range of ratios it can exist in, from low nitrogen to medium nitrogen, greater than 1 and less than 2, and in carbon percentages up to 40%.
ggplotly(
ggplot(data= NEON_full ,aes(x = soilInWaterpH , y = soilInCaClpH)) +
geom_point(aes(color= Phylum))+
labs(title = "Soil in water pH vs. soil in CaCl pH")
)Figure 12 shows the different phylums of bacteria at different levels of pH in water and CaCl. This graph shows a linear relationship between pH in CaCl and pH in water. The phylums were much more widely distributed than just the phylum of bacteroidota alone as there was no one pH with significantly less or more datapoints. The pH also reached 8 on the upper end, but was still around 3 at the lower end. Soil in CaCl was more acidic than soil in water.
ggplotly(
ggplot(data=NEON_full ,aes(x = CNratio, y = Phylum)) +
geom_point(aes(color=Phylum))+
coord_flip()+
labs(title = "CN ratio vs. Phylum")
)Figure 13 shows the carbon nitrogen ratio and how it effects each phylum. Most phylums do better when the carbon nitrogen ratio is higher. However phylums like Acidobacteiodota,Actinomycetota,Pseudomonadota, and Verrucomicrobiota do well in a wide range of carbon to nitrogen ratios ranging from a carbon ration of 22.73 to 36.13.
ggplotly(
ggplot(data= NEON_full ,aes(x = fct_infreq(`Site ID.x`), y = soilTemp)) +
geom_point(aes(color= Phylum))+
geom_boxplot() +
theme(axis.text.x = element_text(angle=45, vjust=1, hjust=1))+
labs(title = "Temperature in realtion to phylum distrabution")
)## Warning: Removed 625 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Figure 14 is a boxplot that shows the soil temperature range at each site. Each phylum is placed on the graph showing the soil temp at hwich it survives best at each site.
ggplotly(
ggplot(data=NEON_full ,aes(x = soilTemp, y = fct_infreq(`Ecosystem Subtype`))) +
geom_point(aes(color= Phylum))+
labs(title = "Soil temperature vs. Ecosystem")
)Figure 15 is showing the soil temp in each ecosystem where the phylums are best suited. Bacteriodota are not present in grasslands, however they are in the tundra which has lower temperatures that grasslands. This implies that Bacteriodota like colder temperatures rather than warmer ones.
ggplotly(
ggplot(data=NEON_full ,aes(x = Phylum, y = fct_infreq(`Ecosystem Subtype`))) +
geom_point(aes(color= Phylum))+
labs(title = "Phylum Locations")
)Figure 16 shows the loaction of the phylums in their specfic ecosystems. Most phylums are present in almost every ecosystem. Actinomycetota is present at every location while Bacteriododta is only present in a few.
ggplotly(
ggplot(data= Site_full ,aes(x = soilInWaterpH , y = soilInCaClpH)) +
geom_point(aes(color= Phylum))+
labs(title = "Soil in Water pH Vs. Soil in CaCl pH")
)Figure 17 shows a linear relationship between the pH of soil in CaCl and the pH of soil in water. The phylums Actinomycetota, Themoproteota, Pseudomonadota, Methylomirabilota and Acidobacteriota have a wide range of pH from ~6-7. The pH of soil in CaCl is less than the pH of soil in water by around 1. The pH of around 7.4 in water and 6.78 in CaCl has the greatest biodiversity. The Phylum Myxcoccota only shows up around 6.88 pH in water and 6.2 pH in CaCl. There are many less phylums in this graph than in the overall phylums. This figure also doesn’t reach the extremes with the maximum less than 8 pH and the minimum for than 6 pH.
ggplotly(
ggplot(data = Site_full ,aes(x = soilTemp, y = fct_infreq(`Ecosystem Subtype`))) +
geom_point(aes(color= Phylum))+
labs(title = "Soil temperature vs. Ecosystem")
)Figure 18 shows that Konza only has one ecosystem, Grasslands. The phylums Pseudomonadota, Methylomirabilota, Acidobacteriota and Actinomycetota live at a wide range of soil temperatures from 20-24 degrees. The temperature of approximately 24 degrees has the most biodiversity with 8 of the 10 phylums from Konza able to live at that temperature. Thermoproteota is a phylum that only lives in soil temperatures of 20-21 degrees. Myxococcota only exists at one soil temperature of approximately 21 degrees.
NEON_MAGs_bact_ind %>%
filter(is.na(Class) | is.na(Order) | is.na(Domain) | is.na(Phylum) | is.na(Family) | is.na(Genus)) %>%
ggplot(aes(x = fct_infreq(`Site`))) +
geom_bar() +
coord_flip() +
labs(title = 'Novel Bacteria by Site')Figure 19 shows the number of novel bacteria at each site. National Grasslands LBJ has the most novel bacteria at around 50, while our site Konza has a large number of approximately 27 novel bacteria. This is much more than the one with the least, Caribou Creek Watershed, which has only around 2 or 3 novel bacteria.
NEON_MAGs_bact_ind %>%
ggplot(aes(x = fct_rev(fct_infreq(Site)), fill = Phylum)) +
geom_bar() +
coord_flip() +
labs(title = 'Total MAGs at Each Site')Figure 20 shows the number of MAGs of each phylum at the different sites. Our site Konza had around 60 MAGs, while National Grasslands LBJ had the highest number of MAGs at almost 200. Santa Rita Experimental Range had the least amount of MAGs at around 30. The Phylum Armatimonadota appeared to have the greatest number of MAGs at most of the sites, while our Phylum Bacteroidota had a very small number of MAGs.
NEON_MAGs_bact_ind2 %>%
ggplot(aes(x = Phylum, fill = Subplot)) +
geom_bar(position = position_dodge2(width = 0.9, preserve = "single")) +
labs(title = 'MAG Count for each Phylum at the KONZA Site')+
coord_flip()Figure 21 shows the MAG counts of each phylum located at the Konza site. There are 9 phylums located at our site that include Verrucomicrobota,Tectomicrobia, Pseudomonadota, Myxococcota, Methylomirabilota, Gemmatimonadota, Chlorflexota, Actinimycetota, and Acidobacteriota.
The site that we were assigned is Konza Prairie Bio Station, Kansas. The metagenome data has data for all of the different NEON sites, as well as a plethora of other sets of data that has to do with the site, environment, species, etc. What we do is analyze our site by comparing it to other sites using the MAG data. The phylogenetic tree of Konza displays the high level of diversity at the site (Figure 3). There are many types of bacteria at this site, and the Sankey tree shows the phyla, classes, orders, etc. In figure 1 and 2, they show the individual and combined assemblies of bacteroidota. This data is useful to visualize the phylogenetic breakdown of the phylum. For example, the families shown under bacteroidota are UBA10428, VadinHA17, Chitinophagacae, Cyclobacteriaceae, Cytophagaceae, Sphingobacteriaceae, Ignavibacteriaceae, and OLB5. These families are compared based on various metrics throughout the lab. Figure 5 showcases the high variability in distribution of each of the families of bacteroidota under different carbon and nitrogen soil percentages. VadinHA17 operates well under high percentages of both elements, while Chitinophagaceae functions under a wide range of percentages. Some families are more variable than others, while some likely fit certain niches better than others. This displays the variation in phenotypic diversity. In figure 6, there seems to be no large statistical pattern when it comes to optimal pH in the soil, aside from Ignavibacteriaceae and OLB5, which seem to have some preference for low pH (acidic conditions) Bacteriodota overall as a phylum has shown to not be present at our site, Konza Prairie Bio Station, Kansas, as shown in figure 7. Figure 7 also shows the average soil temperatures of each site and which families of the Bacteroidota phylum are able to live there. For example, VadinHA17 resides at the TOOl and WOOD sites at soil temperatures of 7.83℃, 8.13℃, and 14.60℃. The distribution of soil temperature in different ecosystems is shown in figure 8. Some of the families tend to fare better in certain biomes, while others, like Chitinophagacae, are rather adaptable and span multiple biomes.
The bacteriodota phylum was however found to be present in wetlands, boreal forests, tundra, temperate forests, and in shrubland (figure 9.) The Chitinophagaca family is present in every ecosystem shown while the other 7 families are only present in a select few of listed ecosystems related to bacteriododta. The data shown in figure 10 is inconclusive and not a representation of the entire phylum. It shows one data point for Ignavibacteriaceae. There is no other family to compare it to with regards to the CN ratio. Similar to figure 5, figure 11 shows the carbon vs nitrogen percentages in the soil. However, it compares the different phyla to one another. As a whole, bacteroidota seems to live on a varied level of carbon and nitrogen percent, favoring a higher carbon percentage. The overall phylums have a very diverse and linear relationship with soil pH levels and soil CaCl levels, figure 12. When this graph is compared to figure 6, bacteriodota only, it shows to be more spread out. This implies that bacteria overall are a diverse micro organism and can easily adapt. Analyzing the CN ratio for all of the phyla, figure 13 shows a high CN ratio, affirming the data in figure 11 that bacteroidota tend to live in areas rich in carbon, and no so rich in nitrogen. Figure 14 shows the distribution of the different phyla in the various sites that they are found. This data is matched against the temperature of the soil. Bacteroidota tends to live on the high end of 10-20°C soil. Amongst all phyla, this is about average according the the bar graph. The solid temperature at each ecosystem plays a large role in which phylums are able to survive as some to better in colder temperatures while other do better in warmer temperatures. In figure 15 we were able to see which phylum was located in each ecosystem and what the optimal soil temperature was related to that specific phylum and ecosystem. Figure 16 shows which biomes each phyla are found at. Bacteroidota is found in boreal forests, wetlands, tundra, temperate forests, and shrublands. This is a wide variety of biomes and conditions, showing the persistence of this phylum in a wide range of environments. In figure 17 you can cee the relationship between soil water pH and soil CaCl levels specifically at Konza in relation to each phylum. The relationship is again linear, with the phylum thermoproteota surviving witht he highest water pH (7.6)and soil CaCl (7.18) levels. The Konza site only has one ecosystem, grasslands. Within grasslands, the temperature does not vary much. Most of the phyla in Konza live within the 20-22°C soil temperature range (Figure 18). Figure 19 shows the amount of novel bacteria at each site, individually assembled. Konza has the third greatest amount of novel bacteria at roughly 27. However, figure 20 shows the total MAG counts at each site, and even though Konza has a high number of novel bacteria, the total MAG count is lower. This could be explained by a similarity/genetic relation between the bacteria at this site compared to other sites. Figure 21 goes more in depth in the site, showing the MAG counts for each individual phylum at Konza.
Throughout this project we looked into the genomics of our phylum, Bacteriodota and our site Konza Prairie Biological Station. We utilized data from NEON and IMG employing a variety of bioinformatic and evolutionary genomic techniques in order to analyze the data gathered. Our research has highlighted which types of bacterial phyla live where, what might impact them, which bacteria phylum our site is most suitable for, and what specifically impacts bacteroidota. Our findings show that Bacteroidota can survive in many environment types which shows that they easily adapt across different soil temperatures and pH levels. The carbon nitrogen ratio also showed that the Chitinophagaceae famly of Bacteroidota is the most adaptable with respect to the other Bacteroidota families. The research also showed that our assigned phylum was not present at the Konza site while many other phyla like Actinimycetota were. Konza has proven to be a diverse site as it has many other phyla located there as well. Overall, this project has shown the relationship between micro organisms and their environments. These findings contribute valuable insights into the genomics of Bacteroidota and the Konza site. This information can be used to provide a foundation for future research and exploring microbial diversity.